AD-A095  025  TEXAS  UNIV  AT  AUSTIN  CENTER  FOR  CYBERNETIC  STUDIES  p/f  l*/l 

UNCONSTRAINED  MINIMIZATION  BY  INTERPOLATIONS  RATES  OF  CONVERSE)#— CTCIU) 
DEC  80  J  BAR2ILAI  n0001*-60-C-0**2 

UNCLASSIFIED  CCS-389  NL 


Research  Report  CCS  389 

UNCONSTRAINED  MINIMIZATION  BY 
INTERPOLATION:  RATES 
OF  CONVERGENCE 

by 

J.  Barzilai 


December  1980 


This  research  was  partly  supported  by  Project  NR047-021,  ONR 
Contract  N00014-80-C-0242  with  the  Center  for  Cybernetic 
Studies,  The  University  of  Texas  at  Austin.  Reproduction  in 
whole  or  in  part  is  permitted  for  any  purpose  of  the  United 
States  Government. 


CENTER  FOR  CYBERNETIC  STUDIES 


A.  Charnes,  Director 
Business-Economics  Building,  203E 
The  University  of  Texas  at  Austin 
Austin,  TX  78712 
(512)  471-1821 


A-  .-w  • 


ABSTRACT 


Newton's  method  for  finding  a  stationary  point  of  f:Rn-<-R  consists 

2  -I 

of  the  iteration  =  x.  -  [V  f(x^)]  '  Vf  (x.).  Its  main  attraction 

is  its  quadratic  convergence.  However,  it  necessitates  computation  and 
inversion  of  the  second  order  derivative  matrix. 

Common  minimization  algorithms  approximate  the  Hessian  or  its  inverse 
by  first  order  (i.e.,  gradient)  information.  First  order  information 
algorithms  in  common  use  are  known  to  have  at  best  superl inear  rate  of 
convergence. 

We  analyze  the  rate  of  convergence  of  a  new  class  of  algorithms 
based  on  n-dimensional  interpolation.  In  particular,  we  present  a  class 
of  algorithms  which  use  first  order  information  ^nly,  while  maintaining 
quadratic  convergence. 
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1.  Introduction 


Most  of  the  commonly  used  algorithms  for  the  unconstrained  minimiza¬ 
tion  of  f:Rn+R  for  n>l,  are  descent  methods.  These  are  based  on  the  iteration 
xi+l  =  xi  +  aic*i’  w^ere  d.jtRn  is  3  search  direction,  and  c^cR  is  the  step- 
size,  usually  determined  by  a  line  search  as  a  minimizer  of  f(x..  +  ad^ ) 
over  a  0. 

The  exception  is  Newton's  method,  which  is  based  on  interpolating 
f  by  a  quadratic  and  minimizing  this  quadratic  at  each  step  of  the  algorithm. 

For  a  general  discussion  of  unconstrained  minimization  techniques 
see  [1,8].  The  following  points  are  relevant  to  our  discussion. 

The  classical  steepest  descent  method  uses  first  order  (i.e., 
gradient)  information  only.  Its  main  drawback  is  its  linear  rate  of 
convergence. 

Newton's  method  converges  quadratically.  However,  it  necessitates 
the  costly  computation  and  inversion  of  the  Hessian  at  each  iteration. 

Other  methods  based  on  first  order  information  are  known  to  converge 
at  best  superl inearly  (e.g.,  [6]). 

Many  of  these  methods  approximate  Newton's  method  in  the  sense  that 
the  search  direction  they  generate  can  be  shown  to  be  a  direction  along 
which  an  appropriate  qua  iratic  is  minimized. 

A  different  approach  is  based  on  the  unfounded  assumption  that 
algorithms  having  the  finite  termination  property  (i.e.,  solution  in  a 
finite  number  of  steps)  for  a  class  of  functions  wider  than  the  class  of 
quadratics,  are  faster  than  those  having  the  quadratic  termination  property. 
Thus,  Jacobson  and  Oksman  [7]  generalize  from  quadratic  termination  to  homo¬ 
geneous  functions  termination.  This  was  further  generalized  (see  e.g.,  [4]). 
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Another  step  toward  discarding  the  quadratic  model  has  recently 
been  taken  by  Davidon  [5].  His  motivation,  however,  partly  coincides  with 
ours. 

In  this  paper,  we  analyze  the  rate  of  convergence  of  n-dimensional 
interpolation  algorithms  for  unconstrained  minimization.  We  note  the 
following. 

Most  of  the  commonly  used  algorithms  for  one-dimensional  minimiza¬ 
tion  are  based  on  polynomial  interpolation.  It  is  well  known  that  Newton's 
method  in  this  case  is  inefficient  in  the  sense  that  quadratic  convergence 
can  be  achieved  using  first  order  information  only  (e.g.,  [8,  p.  142]),  by 
using  two  interpolation  points  rather  than  one.  This  should  make  one  doubt 
whether  Newton's  method  is  a  suitable  model  for  efficient  algorithms. 

Quadratics  are  inadequate  for  n-dimensional  two-point  interpolation 
with  zero  and  first  order  information  (see  Davidon  [5]).  Therefore,  non¬ 
polynomial  interpolation  is  necessary.  Our  analysis  shows  that  the  rate  of 
convergence  is  independent  of  the  interpolating  function.  For  interpolatory 
algorithms,  therefore,  the  question  whether  the  search  direction  coincides 
with  Newton's  direction  or  generalizes  it,  is  irrelevant  to  the  rate  of 
convergence  analysis.  The  same  is  true  for  termination  properties. 

The  main  difficulty  in  the  analysis  of  n-dimensional  interpolation 
algorithms  is  that  the  formulas  for  the  error  in  n-dimensional  interpolation 
are  not  suitable  for  this  purpose.  We  overcome  this  difficulty  by  relating 
the  n-dimensional  problem  to  an  appropriate  one-dimensional  interpolation 
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2.  Minimization  by  Interpolation 

The  interpolation  algorithm  we  study  generates  a  sequence  i x .  >  as 
follows.  Let  sjil,  in  >0  be  fixed  integers,  and  let  T:Rn»-R  depend  on 
r  =  s(m+l)  parameters.  Given  m+1  approximants  xq,-..,x  +j  to  the  solution 
of 


(1)  Vf(x*)  =  0, 


we  use  x.,  x.  x.  to  construct  a  new  approximant  x.^,.  First  we 

li-l  i-m  rr  l+l 

interpolate  f  by  T  requiring 

(2)  T(k)  (x._J )  =  f(k)  j=0, . . . ,m;  k=0,...,s-L. 

f  1 )  (2)  2 

Here  f  '  -  Vf,  f  -  V  f  etc.  The  new  point  x^+j  is  determined  by 

(3)  VT(xi+1)  =  0. 

In  the  following,  we  assume  that  equations  (1)~(3)  have  solutions. 
We  define  the  rate  (or  order)  of  convergence  of  a  sequence  {x^l 
converging  to  x*  as  the  number  p  (if  it  exists)  such  that 


II  *i  +  1  '  x*  II 

IK-  -  x*  ||  P 


C  /  0. 


Here  ||  '  ||  is  a  fixed  arbitrary  norm.  Ortega  and  Rheinboldt  [9,  §9]  refer 
to  the  rate  p  defined  above  as  the  C-order  of  the  sequence  (x^l.  When  it 
exists,  it  coincides  with  their  Q-  and  R-  orders.  We  will  unify  our  results 
for  the  C-,  Q-  and  R-  orders  through  the  use  of  the  C-order  of  convergence. 

We  derive  the  rate  of  convergence  of  the  n-dimensional  interpolating 
algorithm  by  establishing  some  difference  relations  for  the  errors  jjx.-  x*l| 
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To  derive  the  basic  difference  relation  we  need, we  pass  a  curve  in  Rn 
through  the  points  x*  and  x^  +  j,  x._m>  i.e.,  we  determine  a  function 

^ :R+Rn  such  that 

'■^(t^_j)_x^_.  j  -  -1,  0,  1,  ....  m, 

(4) 

^(t*)  =  x*, 

where  the  parameter  t  is  chosen  so  that 

(5)  =  ||  x._j  -  x*  ||  ,  t*  =  ||  x*  -  x*  ||  =  0. 

We  will  later  discuss  this  construction.  Note,  however,  that  the 
construction  of  ip  is  a  part  of  the  analysis  of  the  properties  of  the 
algorithm,  not  a  part  of  the  algorithm  itself. 

Now  define  0 ( t )  =  T(i|/(t)),  4> ( t )  -  f(t(t)).  Equations  (2)  and  (4) 

imply 

(6)  0(k)  (tt_j)  =  .t(k)  ,  j  =  0 . m;  k  =  0,  ...,s-l. 

it  follows  that  (6)  defines  one-dimensional  interpolation  for  which 

convenient  error  formula  exists  (see  Ostrowski  [10,  p.  12]).  Henceforth, 
r+1 

we  assume  0,<pe C  in  a  neighborhood  of  t*  =  0.  Using  the  one-dimensional 

error  formula  we  have 

(7)  *(t) .  0(t)  =  ^■■(r)u)T-  -ft-  (t _  ti_j)s , 

j=0 

where  ip  is  a  point  in  the  interval  determined  by  t,  t^ ,  ...,  t_._m.  Note  that 
formula  (7)  holds  for  general  (not  necessarily  polynomial)  interpolation. 
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We  now  differentiate  (7)  and  set  t  =  0.  From  (1)  and  (2)  we  have 
<j>'(o)  =  0'(t.+j)  =  0,  so  that 

<J>'(o)  -  e'(o)  =  -9'(o)  =  6'(t.+j)  -  9 1  (o)  =  t.+^  0 "  ( r, ) ,  where  ;  is  a  point 
between  t.+1  and  0.  Differentiating  the  right  hand  side  of  (7)  using 
Ralston's  result  [11]  on  the  differentiation  of  the  error  term  generalized 
for  the  hyperosculatory  case  (see  [2]),  we  finally  have 

Lemma  1  Under  the  assumptions  made  above,  the  errors  in  the  n-dimensional 
interpolation  algorithm  satisfy  the  difference  relation 

mm  m 

(8)  t..,  =  m.  y  ts_j  -tT  ts  .  +  N.  TP  tS  ,  , 

v  l+l  l  L-t  l-k  1 I  i-j  i  1 1  i-j 

k=0  j=0  j=0 

jfk 


where 


M. 

i 


M(r*l(ti+1))(r-3)-s 


N. 


i 


N(n.(ti+1))-(-l) 


r 


* 


«<o  ■ 


and  where  (t),  n..(t)  are  in  the  interval  determined  by  t,  t.+j, 
t^_m>  and  c(t^+^)  is  in  the  interval  determined  by  t-+j,0.  Q 

It  follows  from  (8)  that  if  the  initial  errors  tg,  tp  are  small 
enough,  and  if  the  coefficients  {M^},  {N^}  are  bounded,  the  sequence  {t-1 
converges  to  zero,  i.e.,  x.+x*.  Moreover,  if  s  -2,  (8)  implies 


(9) 


(i.e.,  super-linear  convergence).  If  s=l,  we  assume  m>2.  For  m=2 ,  (8) 
is  the  basic  difference  relation  governing  the  behavior  of  the  Quadratic 
Fit  algorithm,  which  is  known  to  converge  superl inearly  (see  Theorem  3.4.1 
in  Brent  [3]).  It  is  evident  from  (8)  that  the  rate  of  m>2  is  not  less  than 
the  rate  for  m=2.  Therefore,  (9)  holds  for  all  s>l,  m^>0  if  r=s(m+l)>3. 
Rewriting  (8)  in  the  form 


we  finally  have 


Lemma  2  Under  the  assumptions  made  above,  and  if  the  sequence 

t^  satisfies  the  difference  relation 


<w>  wwr1 


m 

IT 

j=l 


llj 


Defining  =  log|tJ  and  B..  =  log|A..j,  (10)  implies  the  difference 

equation 

m 

yi+l  -  (S'I)  yi  -  SJ?1  Vj  '  *1*1 

with  indicial  equation 

(11)  tm+1  -  (s-1)  tm  -  sV  tj  =  0, 

J=0 


where  the  sum  in  (11)  is  taken  as  zero  if  m=0. 
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Tamir  [12,13]  proves  that  under  our  assumptions  the  C-rate  of 
convergence  of  the  sequence  { t ^ }  (hence  {x..})  is  given  by  the  unique  positive 
root  of  the  indicia!  equation  (11).  In  this  case,  the  C-,  Q- ,  and  R- 
rates  of  convergence  are  exactly  p,  where  p  is  the  positive  solution  of  (11). 

If  the  limit  of  NT  exists  and  is  zero,  or  if  this  limit  does  not 
exist,  but  the  sequence  { M_j }  is  bounded,  equation  (8)  implies  that  the  Q- 
and  R-  rates  of  convergence  are  still  at  least  p. 

We  summarize  our  results: 


Theorem:  Under  the  above  assumptions,  if  the  sequence  {M..}  is  bounded, 

and  if  the  initial  errors  of  the  interpolation  algorithm  are  small  enough, 
the  sequence  { x^. }  converges  to  the  solution  x*  with  C-  (when  it  exists), 

Q-  and  R-  rates  of  convergence  at  least  p,  where  p  is  the  unique  positive 
solution  of  (11). 

Corollary:  The  rate  of  convergence  of  the  sequence  generated  by  the 

interpolation  algorithm  is  independent  of  the  interpolating  function. 


Remark:  In  order  for  the  sequence  { }  to  be  bounded,  it  is  sufficient 

that  e^r+1^  and  <f/r+^  exist  and  are  continuous,  and  <J>"(0)  f  0.  This  would 
be  the  case  if  f  has  continuous  derivatives  of  order  r+1,  the  parameters 
of  T  depend  continuously  on  the  data  through  (2),  T  has  continuous 


derivatives  of  order  r+1  for  the  appropriate  values  of  the  parameters, 

and  <j>"(0)  =  ip(°)T  V2  f(x*)  ^(0)^0.  Setting  ^.(t)  =  £  aiL-t^  (k=l,...,n), 

K  j=0  JK 

f  ( x*  1 

and  assuming  - — ^ — L  f  0,  the  choice  a ^  =  1  and  a^  =  1  for  k=2,...,n 
X1 


will  ensure  <j>"(0)  f  0. 
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Summary  and  Conclusion 

We  have  shown  that  the  rate  of  convergence  of  n-dimensional 
interpolation  algorithms  is  inherited  from  the  underlying  one-dimensional 
interpolation,  that  it  is  independent  of  the  interpolating  function,  and 
is  given  by  the  unique  solution  of  the  equation 

(12)  tm+1  -  (s-1)  tm  -  s£  tj  =  0  , 

j=0 

where  m+1  interpolation  points  and  s  derivatives  (of  orders  zero  to  s-1) 
are  used. 

Our  work  is  based  on  the  results  of  Traub  [14]  and  Ostrowski  [10] 
for  the  one-dimensional  root-finding  problem.  Tamir  [12,13]  adapted  these 
results  for  the  minimization  problem.  In  [12]  he  studies  the  rate  of 
convergence  of  algorithms  using  function  values  only  (m=0)  with  a  superfluous 
assumption  and  a  false  conjecture.  This  detailed  analysis  is  repeated  in 

[13]  for  the  case  m>0.  He  treats  polynomial  interpolation  only,  and  shows 
that  for  fixed  s  and  m-*«,  the  rate  p  tends  to 


However,  he  neglects  to  realize  the  effect  of  memory  on  the  rate  of 
convergence,  which  is  implied  by  (11)  and  (12). 

Indeed,  for  fixed  s,  the  rate  is  obtained  for  m=0  and  m=l  by  solving 

O 

the  indicial  equations  t-(s-l)=0  and  t  -(s-l)t-s=0,  respectively.  Therefore, 

.2 


p=s-l  for  m=0,  p=s  for  m=l  and  p>  |+\/(U  +1  for  m-««.  It  follows  that 
algorithms  using  more  than  two  interpolation  points  are  inefficient,  and 
two-point  algorithms  are  substantially  faster  than  one-point  algorithms. 
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In  particular  for  m=l  and  s=2  we  have  a  two-point  algorithm  using 

first-order  information  with  second-order  rate  of  convergence  (which  is  a 

well-known  result  in  the  one-dimensional  case). 

Note  that  no  line  search  is  needed  in  this  class  of  algorithms,  and 

that  they  may  be  designed  to  locate  saddle  points  rather  than  minimum 

points.  A  line  search,  however,  may  serve  as  part  of  a  globalizing  procedure. 

Compare  also  the  discussion  in  Davidon  [5]  regarding  the  difficulty 

of  determining  the  effect  of  memory  on  the  performance  of  descent  algorithms. 

We  have  not  computed  the  asymptotic  error  constant,  since  it  depends 

on  the  norm  used  (see  Ortega  and  Rheinboldt  [9]).  This  can  be  computed, 

however,  under  the  appropriate  assumptions  (cf.  Tamir  [12,13]).  We  have 

also  made  no  attempt  at  giving  the  strongest  results  (i.e.,  the  weakest 

assumptions)  possible.  Compare,  for  example,  Brent  [3]. 

Finally,  we  remark  that  the  interpolation  requirements  (2)  leave  much 

freedom  for  the  choice  of  the  interpolation  function  T.  A  useful  choice  for 

n 

it  seems  to  be  a  separable  sum  of  rational  functions  T(s)  =  ^  (x. ).  For 

k=l  K 

? 

example,  for  s=2,  m=l,  one  can  choose  R(x)  =  — —  ’  with  the  values  of 
components  of  T  equal  to  one  another  at  each  interpolation  point. 
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